Home Credit Default Risk

GROUP-05-HCDR


Team and project meta information

Members:

members.png


Project Abstract

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition.

The challenge is to construct a model that can predict the level of risk associated with an individual loan. With this project, we intend to use historical loan application data to predict whether or not a borrower will be able to repay a loan.

After phase 1 and 2 implementation, we realized that it is best to add a few more models to the algorithm to compare and find the best fitting model as we were facing underfitting and overfitting with naive bayes and random forest models. Therefore, for the final phase, we have implemented all of the following in this project:

In phase 1, we faced issues related to data size, unwanted data and lack of data tuning. In phase 2, our main goal is to add feature engineering and hyperparameter tuning to the phase 1 algorithm. In phase 3, we will implement the other mentioned models and also implement neural networks.

The results for this project are as follows:


Project Description (tasks and data)

Data

Dataset link: https://www.kaggle.com/c/home-credit-default-risk/data

Background of the data

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

home_credit.png

Tasks

Workflow.jpeg

Importing all the necessary python libraries:

Reading the csv data files:


Exploratory Data Analysis + Feature Engineering

EDA, or exploratory data analysis, is an essential component of any Data Analysis or Data Science project. Essentially, EDA entails analyzing the dataset to identify patterns, anomalies (outliers), and hypotheses based on our understanding of the dataset.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate.

Data description using pandas dataframe

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Training data information

Testing data information

Feature Extraction

By creating new features from the existing ones (and then discarding the original features), Feature Extraction attempts to reduce the number of features in a dataset. The new reduced set of features will be able to summarize much of the information that was contained in the original set of features. Thus, an abridged version of the original features can be created by combining them.

In our analysis of the data, we found that there are many missing values. Columns with more than 25% of missing values were removed. Our team checked the columns for the distribution of 0's and removed the columns with 85% of rows with only 0's. In addition, we divided the data into numerical and categorical data. The numerical data was handled by creating an intermediate imputer pipeline in which the missing values were replaced with the mean of the data, while the missing values in categorical missing data were handled by encoding the data based upon OHE (One Hot Encoding) and replacing the missing values with the mode of the columns.

Firstly, let's find the percentage of the missing values in each column:

Using 25% as the missing treshold value, we extract all the columns that has missing percentage less than the treshold value:

To optimize the data, we check each column for all the zero or null values and if 85% or more of the data in that column is filled with zero or null, we remove that particular column:

Printing all the columns that contain at least 85% of its data as either zero or null:

Dropping all the columns that contain at least 85% of its data as either zero or null:

Saving all the training data targets in the numerical dataframe:

Checking co-relations of each numerical data which is greater than 3% for both positive and negative co-relation:

Checking for data that contains no missing values in the categorical dataframe:

Dropping all the categorical dataframes that have missing values:


Visual Exploratory Data Analysis

In order to obtain a deeper understanding of the data, EDA involves generating summary statistics based on numerical data and creating various graphical representations in order to better understand the data. Data Visualization represents the text or numerical data in a visual format, which makes it easy to grasp the information the data express. We, humans, remember the pictures more easily than readable text, so Python provides us various libraries for data visualization like matplotlib, seaborn, plotly, etc. In this tutorial, we will use Matplotlib and seaborn for performing various techniques to explore data using various plots.

Visual EDA on numerical data

Numerical data refers to the data that is in the form of numbers, and not in any language or descriptive form.

Here we are plotting graphs of some columns which are positively correlated with target variable and analayzing the trends

Here we are plotting graphs of some columns which are negatively correlated with target variable and analayzing the trends

Plotting heatmap to analyze correlatoin in application train dataset

Plotting heatmap to see correlation in application test dataset

Visual EDA on categorical data

Categorical data refers to a data type that can be stored and identified based on the names or labels given to them.

From this graph we can figure out that if a person owns a realty or not

From the above graph we can see that the most borrowing category are the people who are from the working class

From the above graph we can see the type of loan people take

From the above pie chart we can that Married people tend borrow more money


Modeling Pipelines

Now comes the fun part. Models are general rules in a statistical sense.Think of a machine learning model as tools in your toolbox. You will have access to many algorithms and use them to accomplish different goals. The better features you use the better your predictive power will be. After cleaning your data and finding what features are most important, using your model as a predictive tool will only enhance your model decision making.

Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the modeling pipeline. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction). A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.

The modeling pipeline is an important tool for machine learning practitioners. Nevertheless, there are important implications that must be considered when using them. The main confusion for beginners when using pipelines comes in understanding what the pipeline has learned or the specific configuration discovered by the pipeline.

Therefore, for this project we are going to use 3 different modeling pipeline methods to perform home credit default risk prediction and they are:

We will choose the model that gives the best accuracy for the home credit default risk prediction.

We will be using the following pipeline for this project:

IMG_E8FACEAECA3F-1.jpeg

Importing all the necessary python libraries for the different pipelines we are going to use:

Selecting only the columns that we have finally decided for the numerical and the categorical part:

Here for pipeline for numerical data(numerical_pipeline) we are imputing the missing values by mean of the column and for pipeline for categorical data(categorical_pipeline) we are imputing missing values by most frequent data and implementing one hot encoding for categorical pipeline to deal with categorical data.

Then we have created a pipeline to merge numerical and categorical pipelines using ColumnTransformer.

Saving transformed dataset used for the model training:

We are merging the different dataframes together according to the data diagram shown in the data description section above using primary keys:

We are merging training dataset with the bureau after cleaning the bureau dataset:

Defining a function to find target correlation with the other features:

We are additionally adding a few self-made features to the training dataset and they are as follows:

Finding the correlation between the newly made features and the target feature:

We are shortlisting all the features with a correlation value of greater than 8% with respect to target:

Using the shortlisted features for the training dataset and spliting the whole dataset into training and testing datasets:

We have performed following experiments with different groups of features that we have newly created. With the hypermarameter tuning and the datasets including these features we have calculated the accuracies for different models. Then we have found the best group of features from these experiment.

WhatsApp%20Image%202022-04-19%20at%209.21.33%20PM.jpeg

Hyperparameter Tuning

Link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often times, we don't immediately know what the optimal model architecture should be for a given model, and thus we'd like to be able to explore a range of possibilities. In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically. Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning.

We can use GridSearch to tune the hyperparameters. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved. In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts – train data and test data. In cross-validation, the process divides the train data further into two parts – the train data and the validation data.

Naive Bayes

Library: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

Naive Bayes is a basic but effective probabilistic classification model in machine learning that draws influence from Bayes Theorem.

Bayes theorem is a formula that offers a conditional probability of an event A taking happening given another event B has previously happened. Its mathematical formula is as follows: –

image.png

Where

A and B are two events P(A|B) is the probability of event A provided event B has already happened. P(B|A) is the probability of event B provided event A has already happened. P(A) is the independent probability of A P(B) is the independent probability of B Now, this Bayes theorem can be used to generate the following classification model –

image.png

Where

X = x1,x2,x3,.. xN аre list оf indeрendent рrediсtоrs y is the class label P(y|X) is the probability of label y given the predictors X The above equation may be extended as follows:

image.png

We are not considering Naive Bayes model from phase 2 onwards because the accuracy given by this model was the least during phase 1.

Logistic Regression

Library: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection problems. It computes the probability of an event occurrence.

It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.

Linear Regression Equation:

image.png

Where, y is dependent variable and x1, x2 ... and Xn are explanatory variables.

Sigmoid Function:

image.png

Apply Sigmoid function on linear regression:

image.png

Properties of Logistic Regression:

The dependent variable in logistic regression follows Bernoulli Distribution. Estimation is done through maximum likelihood. No R Square, Model fitness is calculated through Concordance, KS-Statistics.

Random Forest

Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

Random forest works on the Bagging principle. Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging chooses a random sample from the data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. Now each model is trained independently which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as aggregation.

Steps involved in random forest algorithm:

image.png

ADABoost

Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

Ada-boost or Adaptive Boosting is one of ensemble boosting classifier proposed by Yoav Freund and Robert Schapire in 1996. It combines multiple classifiers to increase the accuracy of classifiers. AdaBoost is an iterative ensemble method. AdaBoost classifier builds a strong classifier by combining multiple poorly performing classifiers so that you will get high accuracy strong classifier. The basic concept behind Adaboost is to set the weights of classifiers and training the data sample in each iteration such that it ensures the accurate predictions of unusual observations. Any machine learning algorithm can be used as base classifier if it accepts weights on the training set. Adaboost should meet two conditions:

image_1_joyt3x.webp

Bagging

Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

Bagging is a type of ensemble machine learning approach that combines the outputs from many learner to improve performance. These algorithms function by breaking down the training set into subsets and running them through various machine-learning models, after which combining their predictions when they return together to generate an overall prediction for each instance in the original data.

Bagging is commonly used in machine learning for classification problems, particularly when using decision trees or artificial neural networks as part of a boosting ensemble. It has been applied to various machine-learning algorithms including decision stumps, artificial neural networks (including multi-layer perceptron), support vector machines and maximum entropy classifiers. Bagging can be applied to regression problems, but it has been found to be lesser effective than for classification.

Bagging technique is also called bootstrap aggregation. It is a data sampling technique where data is sampled with replacement. Bootstrap aggregation is a machine learning ensemble meta-algorithm for reducing the variance of an estimate produced by bagging, which reduces its stability and enhances its bias. Bagging classifier helps combine prediction of different estimators and in turn helps reduce variance.

Screenshot-2020-09-08-at-4.17.30-PM-768x455.png

XGBoost

Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

XGBoost is short for “eXtreme Gradient Boosting.” The “eXtreme” refers to speed enhancements such as parallel computing and cache awareness that makes XGBoost approximately 10 times faster than traditional Gradient Boosting. In addition, XGBoost includes a unique split-finding algorithm to optimize trees, along with built-in regularization that reduces overfitting. Generally speaking, XGBoost is a faster, more accurate version of Gradient Boosting.

Boosting performs better than bagging on average, and Gradient Boosting is arguably the best boosting ensemble. Since XGBoost is an advanced version of Gradient Boosting, and its results are unparalleled, it’s arguably the best machine learning ensemble that we have.

Screen%20Shot%202022-05-01%20at%203.40.46%20AM.png


Leakage

Data leakage can cause you to create overly optimistic if not completely invalid predictive models. Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

It is a serious problem for at least 3 reasons:

As machine learning practitioners, we are primarily concerned with this last case.

Do I have Data Leakage?

An easy way to know you have data leakage is if you are achieving performance that seems a little too good to be true. For the pipeline that we have used, we see that there is no data leakage as we have dealt with all NaN values appropriately, and the data type has been set uniform across columns. We have also ensured that the new features that we have generated during feature engineering have been used appropriately during training and is available at the time of inference.


Results and Discussion

In our industry, we consider different kinds of metrics to evaluate our models. The choice of metric completely depends on the type of model and the implementation plan of the model. After finishing the building of the model, multiple metrics can be used to help in evaluating your model’s accuracy.

Metrics

For this project, we are going to use the following performance metrics for each of the training models seperately:

image.png

image.png

where, y_ij, indicates whether sample i belongs to class j or not p_ij, indicates the probability of sample i belonging to class j

image.png

Importing all the necessary metrics libraries:

Model Results

Naive Bayes

Logistic Regression

Random Forest

AdaBoost

Bagging

XGBoost

Compiled results

Here we see that the Bagging Classifier has the highest training accuracy of 99.9%, but an accuracy like this carries the risk of overfitting. As for the XGBoost model, the training accuracy is 91.9%. This score is good enough and appears to be a reliable model. Based on the ROC Area Under Curve values for the XGBoost model, the ROC values is 0.739, showing a large amount of True Positive values, indicating a good fit to the data. As shown in the combined table above, compared to all other models, XGBoost seems to be the best fitting model based on ROC-AUC value.

CSV files creation for kaggle submission

Machine learning competitions are a great way to improve your skills and measure your progress as a data scientist. If you are using data from a competition on Kaggle, you can easily submit it from your notebook. We make submissions in CSV files. Your submissions usually have two columns: an ID column and a prediction column. The ID field comes from the test data (keeping whatever name the ID field had in that data, which for the housing data is the string 'Id'). The prediction column will use the name of the target field.


Neural Network

Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. Their name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another.

Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.

Neural networks rely on training data to learn and improve their accuracy over time. However, once these learning algorithms are fine-tuned for accuracy, they are powerful tools in computer science and artificial intelligence, allowing us to classify and cluster data at a high velocity. Tasks in speech recognition or image recognition can take minutes versus hours when compared to the manual identification by human experts. One of the most well-known neural networks is Google’s search algorithm.

ANN%20Diagram.jpeg

Screen%20Shot%202022-05-01%20at%202.09.10%20AM.png

The following are the results achieved through trial and error considering different number of hidden layers:

WhatsApp%20Image%202022-05-01%20at%204.59.21%20AM.jpeg

We can see from the table that as the number of hidden layers increase, the accuarcy also increases.


Kaggle submission

The results for our project submission on kaggle is as follows:

Note: We did not submit the file for Naive Bayes because the accuracy is way too low to begin with.

LR%20Kaggle.png

RF%20Kaggle.png

Adaboost%20Kaggle.png

Bagging%20Kaggle.png

Screen%20Shot%202022-05-01%20at%204.35.04%20AM.png

NN%20Kaggle.png

Summary of kaggle submission:

Best model: XGBoost


Conclusions

We are attempting to predict whether the credit-less population will be able to repay their loans. We sourced our data from the Home Credit dataset in order to realize this goal. Having a fair chance to obtain a loan is extremely important to this population, and as students we have a strong connection with this. As a result, we have decided to pursue this project. During the first phase, we begin to experiment with the dataset. After performing OHE on the data, we used imputation techniques to fix it before feeding it into the model. In phase 2, we implemented feature engineering and hyperparameter tuning to refine the results.

Finally, we have evaluated the results using accuracy score, log loss, confusion matrix and ROC AUC scores. In this last phase we implemented a few more other models such as AdaBoost, XGBoost and bagging to the previously used models, namely, logistic regression, random forest and naive bayes. Additionally, we also implemented Multilayer Perceptron(MLP) model using Pytorch for loan default classification. We found out that the training accuracy for the MLP model to be 91.92% and a test accuracy of 91.96% which is pretty close to our previous non deep learning models. Deep Learning models require huge amount of data to train itself and thus on a longer run Deep Learning models would work best for HCDR classification as compared to usual supervised models.

The best fitting model is XGBoost with the following scores:

The future scope for this project can include using embeddings in deep learning models or using some advanced classification models like lightGBM/other boosting models that can produce better results. Features can be refined more to increase the accuracy of the model.


References